DOMAIN: Automobile
CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes
PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’
Import all the given datasets and explore shape and size.
• Merge all datasets onto one and explore final shape and size.
• Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.
• Import the data from above steps into python.
#import libraries that will be used for EDA
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
from sklearn.metrics import average_precision_score, confusion_matrix, accuracy_score, classification_report, plot_confusion_matrix
import warnings
warnings.filterwarnings('ignore')
from sklearn.decomposition import PCA
#Load the data from Json file
import json
P1 = json.load(open('Part1 - Car-Attributes.json'))
P1_Data = pd.DataFrame(P1)
P1_Data
P1_names = pd.read_csv('Part1 - Car name.csv')
P1_names
P1_Data = pd.concat([P1_Data, P1_names], axis = 1)
P1_Data.shape
#So now we have 398 observations with 9 different features
P1_Data.to_csv("Part 1 - Automobile.csv" , index = False)
Automobile = pd.read_csv('Part 1 - Automobile.csv')
Automobile.shape
• Missing/incorrect value treatment
• Drop attribute/s if required using relevant functional knowledge
• Perform another kind of corrections/treatment on the data.
#Lets have first look at the data
Automobile.head()
Automobile.info()
#we have all non null columns but then HP being the numeric is displayed as object data type
#we have to get insights of the hp data
Automobile.describe(include = "all").T
#Clearly there are some non numeric data which needs to be take care of
# isdigit()? on 'hp'
hpIsDigit = pd.DataFrame(Automobile.hp.str.isdigit()) # if the string is made of digits store True else False
#print isDigit = False!
Automobile[hpIsDigit['hp'] == False]
# Missing values have a'?''
# Replace missing values with NaN
Automobile = Automobile.replace('?', np.nan)
Automobile[hpIsDigit['hp'] == False]
Automobile['hp'].median() # note the mdedian of the hp column
Automobile.isnull().any() # check on all the feature contaning null values
# Since we have only hp column with Null values,we can fill out NA values with median of hp
Automobile = Automobile.fillna(Automobile['hp'].median())
Automobile[hpIsDigit['hp'] == False]
Automobile.isnull().any() # Rechek for NULL values in any columns
Automobile['hp'] = Automobile['hp'].astype('float64') # converting the hp column from object / string type to float
#Looking at the data we can say that car name is not usefull to achive our objective,
#Hence we can drop the car name column
Automobile = Automobile.drop('car_name', axis = 1)
Automobile.shape #confirmation on car_name column is dropped
#SInce we have analysed and understood the data distribution,
#now lets logically manipulate the data for better understandability
#Year as a column can be replaced with the Age of the car column which inturn will give better understandability of the data
Automobile['age'] = 83 - Automobile['yr']
Automobile['age'].describe()
# we can drop the yr column now
Automobile = Automobile.drop('yr', axis = 1)
Automobile.head() #confirmation on car_name column is dropped
• Perform detailed statistical analysis on the data.
• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
#Lets check the distribution of each feature
Automobile.hist(color='lightblue', edgecolor = 'black', alpha = 0.7, figsize = (20,10), layout=(3,3))
plt.tight_layout()
plt.show()
# We can see only acc feature is normally distributed
# For Skewness, closer the value to 0, perfectly the distribution follows normal distribution
#negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure.
#positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure.
Automobile.skew()
#We can infer that distribution is quite good as we dont have any extreme skewed distribution except for horsepower
#Lets cehck the presence of outliers in the data set
for columns in Automobile:
plt.figure()
plt.title(columns)
sns.boxplot(data = Automobile[columns], orient="h")
# We can observe that Milage, horsepower and acceleration have presence of outliers
# Presence of ouliers will impact clustering models because of its extreme outlying distance from the distribution
sns.pairplot(Automobile , diag_kind='kde')
#From this single visual representation we can make infrences on univariate and bivariate distributions
# pairplot can be used as a one place to analyse entre data distribution and correlation
# From above we can see that individual distribution is near to normal distribution for most of the features
# except for Origin and cylinders.
# From above we can also infer that based on the distribution curves, we can have probably 3 clusters under this data set
# correlation of milage follows negative slope with displacement, horsepower and weight
# which means Milage starts reducing with increase in these 3 values
Automobile.corr()
# As per our previous assumption,
# we have negative correlation between milage and (displacement, horsepower, weight, cylnders)
#Visual representation of above correlation
plt.figure(figsize = (10,5))
sns.heatmap(Automobile.corr(), annot = True)
# From below we can infer that a car's Milage will reduce when numebr of cylinders, displacement ,
# horsepower and weight of the vechical increases
#however positive correlation among year and origin needs to be we understood
#If the objective of our project was just to cluster the data, we might prefer fixing outliers with below
#Take logaritmic transform for Milage, horsepower and acceleration to remove outliers so that
# impact on centroids can be minimized, but since we need to build model removing outlier might lead to information loss
#Automobile['hp'] = np.log(Automobile['hp'])
#Automobile['acc'] = np.log(Automobile['acc'])
#Automobile['mpg'] = np.log(Automobile['mpg'])
#Automobile.describe().T
• Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data.
• Share your insights about the difference in using these two methods.
#Before we start creating clusters, lets scale the data
Automobile_sc = Automobile.apply(zscore)
Automobile_sc.head()
# Variables are now scaled. Let us now try to create clusters
#import library for K means
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
cluster_errors = []
silhouette_value = []
for i in range(1,10):
kmeans = KMeans(n_clusters = i)
kmeans.fit(Automobile_sc)
labels = kmeans.labels_
cluster_errors.append(kmeans.inertia_)
if i > 1:
silhouette_value.append(silhouette_score(Automobile_sc,labels))
else:
silhouette_value.append(1)
kmean_df = pd.DataFrame({"Number of cluster": range(1,10), "Cluster Iniertia": cluster_errors, "Silhouette score": silhouette_value})
kmean_df
#From below it is clearly visible that after CLuster 4 Silhouete score bounces back hence we can opt for 4 Clsuter
# Another way of finding ideal K value is using elbow method
plt.plot(range(1,10), cluster_errors, marker = "o")
# Following the elbow method, below representations we can see that
#after 3 clusters we dont see big shift in the error level when moving gradually towards higher number of clusters.
# after Looking at both the methods, we can consider K = 4 clusters as per above mentioned silhouette observation
# Lets find which linkage method we can choose for using HIrarchical clustering for our given data set
from scipy.cluster.hierarchy import cophenet, linkage
from scipy.spatial.distance import pdist
# cophenet index is a measure of the correlation between the distance of points in feature space
# closer it is to 1, the better is the clustering
# we will see for these linkage methods "single","complete","centroid","average","ward"
Z = linkage(Automobile_sc, metric='euclidean', method='single')
c, coph_dists = cophenet(Z , pdist(Automobile_sc))
print("Cophenetic correlation for given data set with Single linkage method is:" , c)
Z = linkage(Automobile_sc, metric='euclidean', method='complete')
c, coph_dists = cophenet(Z , pdist(Automobile_sc))
print("Cophenetic correlation for given data set with complete linkage method is:" , c)
Z = linkage(Automobile_sc, metric='euclidean', method='average')
c, coph_dists = cophenet(Z , pdist(Automobile_sc))
print("Cophenetic correlation for given data set with average linkage method is:" , c)
Z = linkage(Automobile_sc, metric='euclidean', method='ward')
c, coph_dists = cophenet(Z , pdist(Automobile_sc))
print("Cophenetic correlation for given data set with ward linkage method is:" , c)
#Based on above, we can see that average linkage method gives the better results hence we can go for that method
# Importing necesary library
from sklearn.cluster import AgglomerativeClustering
HC_errors = []
for i in range (1,10):
HCmodel = AgglomerativeClustering(n_clusters=i, affinity='euclidean', linkage='complete')
HCmodel.fit(Automobile_sc)
labels = HCmodel.labels_
if i > 1:
HC_errors.append(silhouette_score(Automobile_sc,labels))
else:
HC_errors.append(1)
print("Silhouette coefficient for Hirarchical clustering technique with cluster 1 to 10 is: \n")
HC_errors
#From below it is clearly visible that after CLuster 5 Silhouete score bounces back and increases
#hence we can opt for 5 Clsuter based on Silhouette score
By using both K mean and Hirarchical clustering methods, it is a challenge to determine the correct number of clusters we will requrie for the data set.
As per the visual representation (spikes in pairplot )3 clsuters is what we inferred but then with K mean silhouette coefficients, 4 clusters will be the optimal number of clusters
With Hirarchical clustering technique its either by just looking at Dendogram for deciding the number of clusters or by silhouette coefficients results, in our case 5 clusters.
For this data, lets go ahead with K mean clusters and make individual cluster inferences
• Mention how many optimal clusters are present in the data and what could be the possible reason behind it.
• Use linear regression model on different clusters separately and print the coefficients of the models individually
• How using different models for different clusters will be helpful in this case and how it will be different than using one single model without
clustering? Mention how it impacts performance and prediction.
# Based on above inference lets create kmean with 4 clusters
kmeans = KMeans(n_clusters = 4)
kmeans.fit(Automobile_sc)
labels = kmeans.labels_
#Adding the labels into our orginal data set
Automobile['Cluster4'] = labels
Automobile.head()
#Lets check the individual group wise average values for each feature
temp_Automobile = Automobile.groupby(['Cluster4'])
print("Total data in each cluster")
temp_Automobile.count()
#Data in each clsuter is fairly distributed
print("Average data in each cluster")
temp_Automobile.mean()
WE can see that all the clusters are very close to each other based on
Cluster 2(mpg 19): car with average age, cylinders, horsepower gives moderate milage
Cluster 1(mpg 14): with more number of cylinders we can see that Milage is reduced
Cluster 3(mpg 33): New car gives more milage then older ones
Cluster 0(mpg 25): with reduced weight and cylinders even if the Car's age is more, we can expect the milage to be pretty good at around 25 mpg
Also we can infer that Origin from base 2 tend to give more milage then Origin 1 cars
#Adding the labels into our orginal data set
Automobile_sc['Cluster4'] = labels
Automobile_sc.head()
plt.scatter(Automobile_sc['Cluster4'],Automobile_sc['mpg'], color = ['Lightgreen'])
#We can observe that all the clusters overlap each other to considerable extent,
# Hence it is justifis that it will be difficult for a model to perform with high accuracy
sns.catplot(x='mpg', y = 'mpg', data = Automobile_sc, hue = 'Cluster4')
#Anoter way of representing mpg data across all 4 clusters, we can see that cluster 2 iscomplete overlap under all three clusters
#Creating model on original dataset
x_sc = Automobile_sc.drop('mpg', axis = 1)
y_sc = Automobile_sc['mpg']
x_train_sc, x_test_sc, y_train_sc, y_test_sc = train_test_split(x_sc, y_sc, test_size=0.30, random_state=1)
#creating Liner regression on entire dataset
from sklearn.linear_model import LinearRegression
LR_model_sc = LinearRegression()
LR_model_sc.fit(x_train_sc, y_train_sc)
print("Performance of Linear regression model on entire data set is:", LR_model_sc.score(x_test_sc,y_test_sc))
print("\nCoefficient of Linear regression model on entire data set is: \n", LR_model_sc.coef_)
# we can see that wihtout tuning our model is able to perform with accuracy of 86%,
# now lets find how model performs on individual clusters
#Create different data set based on clusters to run liner regresison on each data set
cl0 = Automobile_sc.loc[Automobile_sc['Cluster4']== 0]
cl1 = Automobile_sc.loc[Automobile_sc['Cluster4']== 1]
cl2 = Automobile_sc.loc[Automobile_sc['Cluster4']== 2]
cl3 = Automobile_sc.loc[Automobile_sc['Cluster4']== 3]
#Create x and y data set for each cluster data
Xcl0 =cl0.drop(['mpg'], axis=1)
ycl0 = cl0[['mpg']]
Xcl1 =cl1.drop(['mpg'], axis=1)
ycl1 = cl1[['mpg']]
Xcl2 =cl2.drop(['mpg'], axis=1)
ycl2 = cl2[['mpg']]
Xcl3 =cl3.drop(['mpg'], axis=1)
ycl3 = cl3[['mpg']]
#Create training and testing data set for each cluster with 70:30 ratio
X_train_cl0, X_test_cl0, y_train_cl0, y_test_cl0 = train_test_split(Xcl0, ycl0, test_size=0.30, random_state=1)
X_train_cl1, X_test_cl1, y_train_cl1, y_test_cl1 = train_test_split(Xcl1, ycl1, test_size=0.30, random_state=1)
X_train_cl2, X_test_cl2, y_train_cl2, y_test_cl2 = train_test_split(Xcl2, ycl2, test_size=0.30, random_state=1)
X_train_cl3, X_test_cl3, y_train_cl3, y_test_cl3 = train_test_split(Xcl3, ycl3, test_size=0.30, random_state=1)
#creating Liner regression on each cluster
from sklearn.linear_model import LinearRegression
LR_model_cl0 = LinearRegression()
LR_model_cl1 = LinearRegression()
LR_model_cl2 = LinearRegression()
LR_model_cl3 = LinearRegression()
LR_model_cl0.fit(X_train_cl0, y_train_cl0)
print("\Performance of Linear regression model on cluster 0 is:", LR_model_cl0.score(X_test_cl0,y_test_cl0))
print("\nCoefficient of Linear regression model on cluster 0 is: \n", LR_model_cl0.coef_)
LR_model_cl1.fit(X_train_cl1, y_train_cl1)
print("\Performance of Linear regression model on cluster 0 is:", LR_model_cl1.score(X_test_cl1,y_test_cl1))
print("\nCoefficient of Linear regression model on cluster 1 is: \n", LR_model_cl1.coef_)
LR_model_cl2.fit(X_train_cl2, y_train_cl2)
print("\Performance of Linear regression model on cluster 0 is:", LR_model_cl2.score(X_test_cl2,y_test_cl2))
print("\nCoefficient of Linear regression model on cluster 2 is: \n", LR_model_cl2.coef_)
LR_model_cl3.fit(X_train_cl3, y_train_cl3)
print("\Performance of Linear regression model on cluster 0 is:", LR_model_cl3.score(X_test_cl3,y_test_cl3))
print("\nCoefficient of Linear regression model on cluster 3 is: \n", LR_model_cl3.coef_)
From Above we can see that model performance is different for different clusters which can be fine tuned further for better model performance by using different models for different clusters.
So when the new observation of the car comes in, we can identify which cluster it belongs to and then accordignly realize what will be its probable average milage per galon. In this particular data set after observing model's performances, we can go for complete data set to create the model rather than first clustering and then builing model on same.
CLustring the data can be opted in case we have too large data set and have clusters which are well separable, eventually resulting in better models for predicting target variables.
For better analysis in future, company can update the features like year to age incase of descrete varaibles, if possible standard format of one hot encoding should be followed for better model performances
DOMAIN: Manufacturing
PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.
Solution approach: In order to find data for target column, we can use a machine learning model which can read the pattern among the existing complete data and accordingly generate the synthetic data for remaining incomplete data.
import pandas as pd import warnings warnings.filterwarnings('ignore') pd.set_option('display.max_rows', None)
#Import data and take the first look at the data
data = pd.read_excel('Part2 - Company.xlsx')
data.head()
#Check the shape of the data
data.shape
#We can see that we have null values only in Quality field
data.isnull().any()
# We have total of 18 columns with null values
data['Quality'].isnull().sum()
#Saperating data with complete and incomplete values
incomplete_data = data[data['Quality'].isnull() == True]
complete_data = data[data['Quality'].isnull() == False]
#Check the index positions of missing values
incomplete_data.index
#Check the index positions of complete data
complete_data.index
# Split Xdrop and y into training and test set based on available data
x_train = complete_data.drop('Quality', axis = 1)
y_train = complete_data['Quality']
x_test = incomplete_data.drop('Quality', axis = 1)
y_test = incomplete_data['Quality']
# Creating logistic Regresison model to generate predict missing values of Quality
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
print("Synthetic data generated for incomplete data is\n", y_predict)
# Now that we have Synthetic data generated for missing values, lets complete the data by filling in missing values
pd.set_option('display.max_rows', None)
incomplete_data['Quality'] = y_predict
Newdata = incomplete_data.append(complete_data)
Newdata.sort_index()
#Lets confirm themissing values again
Newdata['Quality'].isnull().sum()
DOMAIN: Automobile
CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.'
PROJECT OBJECTIVE: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data.
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from scipy.stats import zscore from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn import preprocessing from sklearn.metrics import average_precision_score, confusion_matrix, accuracy_score, classification_report, plot_confusion_matrix import warnings warnings.filterwarnings('ignore')
#Load tha data set and take the fist glance at the data
cars = pd.read_csv('Part3 - vehicle.csv')
cars.head()
#CHeck the shape of the data
cars.shape
# we have 846 obsrvations with 19 differnet features of cars/van/bus
cars.info()
#We can see that there are may feature with null values however all the columns are numeic except for target columns
cars.isnull().sum()
# we can see below are the features with missing values
#Bfore fixing the null values lets check 5 point summary of te data
cars.describe().T
#we can observe that mean value of almost all the features is near aound at 50% of the data
# so we can say that data is good and we can use Mean value of each feature to be rplaces with null values
# Also, sicne we have all the columns, we dont have any '?' values in the dataset
#Lets fix the null values first before proceeding further
#Replace blank valus with Nan values
cars = cars.replace('', np.nan)
#Replace the Null value with the mean value of each column
for i in cars.columns[:17]:
mean = cars[i].mean()
cars[i] = cars[i].fillna(mean)
#Lets confirm if we have any null values in the data anymore
cars.isnull().sum()
#Lets check the class wise distribution of the data
cars['class'].value_counts()
#Lets check the distribution of each feature
cars.hist(color='lightblue', edgecolor = 'black', alpha = 0.7, figsize = (15,20), layout=(6,3))
plt.tight_layout()
plt.show()
#we can see that distribution is quite normal for all columns except for Max.length_aspect_ratio, pr.axis_aspect_ratio,
#scaled_radious_of_gyrathon, pr.axis_rectangularity, skewness_about, skewness_about.1
# For Skewness, closer the value to 0, perfectly the distribution follows normal distribution
#negative skew: The left tail is longer; the mass of the distribution is concentrated on the right of the figure.
#positive skew: The right tail is longer; the mass of the distribution is concentrated on the left of the figure.
cars.skew()
#From below we can clearly see that pr.axis_aspect_ratio,max.length_aspect_ratio and scaled_radius_of_gyration.1
#are highly skewed data and distribution for other features is quite acceptale
#Lets check for the outliers in teh dataset
for columns in cars.columns[:17]:
plt.figure()
plt.title(columns)
sns.boxplot(data = cars[columns], orient="h" , color = 'pink')
# we can see that we have outliers in radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance,
# scaled_variance1, scaled_radius_of_gyration.1 , skweness_about
# Treating ouliers is really a big decision as it might caus information loss in case of data sets with large no of outliers
#SVM is not very robust to outliers and hence presence of a few outliers can lead to misclassification.
#Hence we choose to get rid of outliers
# For all the columns find the 1st and 3rd quartlie value
# Then calculate cutoff value for each column based on 1st and 3rd quartlie value
# Replace the outliers with median values
for columns in cars.columns[:17]:
#find 1st and 3rd quartile
q1 = cars[columns].quantile(0.25)
q3 = cars[columns].quantile(0.75)
iqr = q3 -q1
#outlier cutoff
low = q1-1.5*iqr
high = q3+1.5*iqr
#replace oulier with median of each column
cars.loc[(cars[columns] < low) | (cars[columns] > high), columns] = cars[columns].median()
#Lets check for the outliers in the dataset again
for columns in cars.columns[:17]:
plt.figure()
plt.title(columns)
sns.boxplot(data = cars[columns], orient="h" , color = 'pink')
# we can observe that we dont have nay outlier in the data set
# Lets check the pairplot to see the individual distribution on diagonal plots
# and correlation of all columns among each other on either sides of diagonal plots
sns.pairplot(cars, hue = 'class')
plt.legend = True
plt.show
# Checking the correlation using pairplot, we can observe many positive slope and few negative slope
# that means many features are positively correlated with each other, meaning if one grows other gorws as well.
# We can also observe Distribution of each class on diagonol distribution under each feature
#(Blue - Van, Orange - car, green - bus)
#SInce the Class data is not linearly separable we can not have and Hard margin when we think of SVM
#for correlation, closer the value to 1, higher is the corelation betwen two features
cars.corr()
# we can observer clear correlation between each feature, lets visualize usign heatmap and then summarize our observation
#Visualize the above correlation using heat map
plt.figure(figsize = (20,18))
sns.heatmap(cars.corr(), annot = True)
WE can see that there is heavy correlation between many features. like
circularity is highly correlated with max.length_rectangularity, scaled_radius_of_gyration
distance_circularity with scatter_ratio, axis_rectangularity,scaled_variance,scaled_variance1
Scatter_ratio with axis_rectangularity,scaled_variance, scaled_variance1
elongatedness is negatively correlated with most of the features
pr.axis_rectangularity with scaled_variance, scaled_variance.1,
scaled_variance with scaled_variance.1 and skewness_about.2 with hollows_ratio
So for the classification models, more the linearly correlated features more difficult it becomes to make accurate predictions hence if two features are highly correlated then there is no point using both features we can drop any one feature out of two.
Since we have many very highly colinear feature we might need to reduce the columns
x = cars.drop(['class'] , axis = 1)
y = cars['class']
# Scaling the data set using standardScaler
from sklearn.preprocessing import StandardScaler
x_sc = StandardScaler().fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_sc,y,test_size=0.30, random_state=1)
# Import necessary libraries
from sklearn.svm import SVC
# Building a Support Vector Machine on train data
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(x_train, y_train)
y_pred = svc_model.predict(x_test)
print ('Accuracy of SVC model without reducing any dimensions is:', accuracy_score(y_test, y_pred)*100 )
print ('classification report of this model is:' )
print(metrics.classification_report(y_test, y_pred))
plot_confusion_matrix(svc_model,x_test,y_test)
# we can see that without reducing dimensions the model accuracy is 90.94%
#under which it successfully predicted 88%, 92% and 90% correct bus,car and van silhouette respectively.
# For Bus: Model correctly predicted 53 observations as bus, and incorrectly predicted 5 cars and 1 van
# For Car: Model correctly predicted 123 observations as car, and incorrectly predicted 6 bus and 5 van
# For Van: Model correctly predicted 55 observations as van, and incorrectly predicted 2 bus and 5 car
We can reduce dimensions using PCA by following below steps:
# Calculate the covariance matrix
cov_matrix = np.cov(x_sc.T)
print("cov_matrix shape:",cov_matrix.shape)
print("Covariance_matrix",cov_matrix)
#Calculating Eigen Vectors & Eigen Values:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)
#Sort eigenvalues in descending order
# Make a set of (eigenvalue, eigenvector) pairs:
eig_pairs = [(eigenvalues[i], eigenvectors[:,i]) for i in range(len(eigenvalues))]
# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()
eig_pairs.reverse()
print(eig_pairs)
# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eigenvalues))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eigenvalues))]
# Let's confirm our sorting worked, print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigvalues_sorted)
tot = sum(eigenvalues)
var_explained = [(i / tot) for i in eigvalues_sorted] # an array of variance explained by each
# eigen vector... there will be 18 entries as there are 18 eigen vectors)
cum_var_exp = np.cumsum(var_explained) # an array of cumulative variance. There will be 18 entries with 18 th entry
# cumulative reaching almost 100%
print('Cumulative Variance explained:\n' , cum_var_exp)
# Plotting the Explained variance and principal components
plt.figure(figsize=(10,5))
plt.bar(range(1,19), var_explained, alpha=0.5, align='center', label='individual explained variance')
plt.step(range(1,19),cum_var_exp, where= 'mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.show()
# From below we plot we can clealry observe that 10 dimension() are able to explain 98 %variance of data.
# so we will use first 10 principal components going forward and calulate the reduced dimensions.
# P_reduce represents reduced mathematical space....
x_reduce = np.array(eigvectors_sorted[0:10]) # Reducing from 18 to 10 dimension space
x_pca_10D = np.dot(x_sc,x_reduce.T) # projecting original data into principal component dimensions
x_reduced_pca = pd.DataFrame(x_pca_10D) # converting array to dataframe for pairplot
x_reduced_pca
# Now that we have our PCA reduced data , we can now train the SVM model and find out how well does it perform
# Split the data based on reduced data set
x_train_pca, x_test_pca, y_train_pca, y_test_pca = train_test_split(x_reduced_pca,y,test_size=0.30, random_state=1)
# Lets create SVM model with reduced dimensions
# Building a Support Vector Machine on train data
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(x_train_pca, y_train_pca)
y_pred = svc_model.predict(x_test_pca)
print ('Accuracy of SVC model with reduced dimensions is:', accuracy_score(y_test_pca, y_pred)*100 )
print ('classification report of this model is:' )
print(metrics.classification_report(y_test_pca, y_pred))
#Plotting the confusion Matrix
plot_confusion_matrix(svc_model,x_test_pca,y_test_pca)
# we can see that after reducing dimensions the model accuracy is 88.97%
#under which it successfully predicted 90%, 91% and 84% correct bus,car and van silhouette respectively.
# For Bus: Model correctly predicted 53 observations as bus, and incorrectly predicted 5 cars and 1 van
# For Car: Model correctly predicted 121 observations as car, and incorrectly predicted 8 bus and 4 van
# For Van: Model correctly predicted 52 observations as van, and incorrectly predicted 1 bus and 9 car
From above we can see that with reduced number of dimensions the model's performance is close to 89% and without reducing the dimensions models accuracy is 90%, Based on the project context, we can easily compromise on 1% of the model's performance as against the time and resources being used by the model.
We can observe that from 18 dimensions using PCA we were able to reduce dimensions and come down to 10 dimensions. Based on the business decisions we can further reduce the dimensions or increase the dimensions considering any one of the two factors.
DOMAIN: Sports management
CONTEXT: Company X is a sports management company for international cricket.
PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from scipy.stats import zscore from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn import preprocessing from sklearn.metrics import average_precision_score, confusion_matrix, accuracy_score, classification_report, plot_confusion_matrix import warnings warnings.filterwarnings('ignore') pd.set_option('display.max_rows', None)
cricket = pd.read_csv('Part4 - batting_bowling_ipl_bat.csv')
cricket.head()
# we can see that there are entire null rows in the data set
cricket.shape
# shape of the dataset
cricket.info()
# clearly we have 90 non null records in each feature out of total 180 records
# looking at the head, we can say hat every alternate record is a null record
cricket.isnull().sum()
# we can confirm that there are 90 null records and 90 non null records by lookin gat above and below values
cricket.dropna(axis=0, inplace = True)
#lets drop the null records
cricket.shape
#rechecking the shape of the data set
cricket.info()
#recheck the head to find if we have any null values, clearly all the null records are now dropped and we have
# data set ready for further EDA
cricket.describe().T
# we can see that that almost all the columns have outliers since max value is much higher than 75% of the data
#Lets check the distribution of each feature
cricket.hist(color='lightblue', edgecolor = 'black', alpha = 0.7, figsize = (20,10), layout=(3,3))
plt.tight_layout()
plt.show()
#we can see that distribution is not at all normal for all columns
cricket.skew()
# We can see that except for Runs all the other columns are highly skewed
#Lets check for the outliers in teh dataset
for columns in cricket.columns[1:7]:
plt.figure()
plt.title(columns)
sns.boxplot(data = cricket[columns].values, orient="h" , color = 'pink')
# we can see that all columns have presence of outliers, sinceour objective is to rank the batsman
# we will require the the extrem values to score the batsman
# Lets check the correlation of each column
sns.pairplot(cricket, diag_kind = 'kde')
# we can see that almost all the columns support each other positively and there is no negative slope or correlation
cricket.corr()
# We can see from below that our data set is pretty much real where in,
# runs is positively correlated with all the other features, meaning if the Batsman scores more run he is likely to have
# better Average, Strike rate, Fours, Sixes and half century
# similarly, if the Bats man hits half century, he will likely to have good numbers on runs, average, SR, fours and sixes.
Since all the column is a clear attribute of a good batsman, we should not neglect any of the columns
we can assign each bats man a score from 1 to 4 his performance on each column
then find out the best one based on the total score earned by each batsman
This way, batsman will be measured on each attribute like Runs, Average, Strike Rate, Fours, sixes and Half Centuries
#Creating new dataframe for score calculation
temp = cricket.copy()
temp1 = cricket.copy()
temp.info()
# Calculating scores based on the performance of each player
for columns in temp.columns[1:7]:
#find 1st, 2nd and 3rd quartile
q1 = temp[columns].quantile(0.25)
q2 = temp[columns].quantile(0.50)
q3 = temp[columns].quantile(0.75)
print("quartile values for", columns, "is",q1,q2,q3)
#replace each value with score 1 to 5 based on below criteria
temp.loc[(temp[columns].values < q1) , columns] = 0
temp.loc[(temp[columns].values >= q1) & (temp[columns].values < q2) , columns] = 1
temp.loc[(temp[columns].values >= q2) & (temp[columns].values <= q3) , columns] = 2
temp.loc[(temp[columns].values > q3) , columns] = 4
#Calculate the score
#temp.drop(labels = 'Score', axis = 1, inplace = True)
temp['Score'] = temp.sum(axis = 1)
# Take a look at the data
temp.head(10)
#Check the batsman with same scores
temp['Score'].value_counts()
#Copy the score volumn to original dataset before assigning Ranks
cricket['Score'] = temp['Score']
cricket.head(20)
#Sort the batsman with highest score and for the batsman with same score,
# one with higher Average will preceed other
cricket.sort_values(['Score','Ave'] ,ascending=[False,False], inplace= True)
cricket.head(20)
#Assigning the Rank to all 90 Batsman
cricket['Rank'] = range(1,91)
cricket
temp1.drop(labels ='Name', axis = 1, inplace = True)
XScaled =temp1.apply(zscore)
XScaled.head()
# Creating single dimension which will represent all the features using PCA technique
pca3 = PCA(n_components=1)
pca3.fit(XScaled)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(XScaled)
Xpca3
temp1['Name'] = temp['Name']
temp1['PCA dimension'] = Xpca3
temp1.head()
temp1.sort_values(['PCA dimension'] ,ascending=[False], inplace= True)
temp1.head(20)
temp1['Rank'] = range(1,91)
temp1
Question 1: List down all possible dimensionality reduction techniques that can be implemented using python. Answer:
Based on research made for finding more techniques for dimension reduction, i found below set of techniques which can be implemented: Independent Component Analysis
Answer: Yes, it is possible to reduce dimensions on multimedia data as well. For implementation, let's try the PCA approach on inbuild Iris data set.
# The goal of this module is to verify the implentation of PCA on multimedia data, hence we will purely concentrate on
# verifying models performance before and after applying PCA on iris flower data set.
#loading iris flower data set
iris = pd.read_csv('iris-1.csv')
iris.head()
iris.shape
iris.isnull().sum()
iris.describe()
iris.info()
#Lets split the data into x and y data set
x = iris.drop('Species', axis = 1)
y = iris['Species']
x.shape,y.shape
#applying stanard scalar on x data before splitting into traning and testing data
#Before we start creating clusters, lets scale the data
x_sc = x.apply(zscore)
x_sc.head()
# Lets split the data into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=1)
#Lets create KNN to predict the Species
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(x_train,y_train)
print("Train score before PCA",knn.score(x_train,y_train),"%")
print("Test score before PCA",knn.score(x_test,y_test),"%")
# From the below accuracy it is clear that our data set it very well organized and distributed
# the species of one kind are lying close to one another there by allowing model to predict accurately with different k values
# Now lets try reducing the dimension using PCA and find out the difference in the model performance
from sklearn.decomposition import PCA
pca = PCA()
X_pca = pca.fit_transform(x)
# Lets find the covariance of all dimensions
pca.get_covariance()
# from below we can clearly see that 99% of the data is explained by the single dimension
pca.explained_variance_ratio_
#SInce our data is good and all the species are very well organized,
# it justifies that PCA's single dimension can very well explain the entire data set.
# So lets try and create single dimension data
pca_1=PCA(n_components=1)
x_pca1=pca_1.fit_transform(x)
# Lets split the data into training and testing data using new PCA dimension
x_train, x_test, y_train, y_test = train_test_split(x_pca1, y, test_size = 0.3, random_state=1)
#Lets create KNN to predict the Species
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(x_train,y_train)
print("Train score after PCA",knn.score(x_train,y_train),"%")
print("Test score after PCA",knn.score(x_test,y_test),"%")
# From the below accuracy it is clear that our data set it very well organized and distributed
# the species of one kind are lying close to one another there by allowing model to predict accurately with different k values